feat(repo): add scripts to synthesize and consume azl repodata#17139
feat(repo): add scripts to synthesize and consume azl repodata#17139reubeno wants to merge 1 commit into
Conversation
|
This looks good Reuben, thanks! Can you please add a few workflow examples and use cases? Interpreting command line option to what it does took some effort. That could be a me problem, but examples can help quite a bit. |
Adds three new tools under `scripts/repo/` for working with Azure Linux package repositories: * `synthesize-repodata.py` — given one or more upstream repo prefixes (Standard Azure Linux Repo Layout: per-channel main/debuginfo/srpms sub-repos) and/or explicit per-repo overrides, synthesize a fresh set of per-destination repodata trees that route each package to its intended channel based on azldev component metadata. Local repo overrides take precedence over upstream when NEVRAs collide (CLI order is preserved). The output is a static directory tree of standard `createrepo_c` repodata with absolute upstream URLs in package locations, so the synthesized repodata can be served from anywhere without needing to mirror the RPM content. * `dnf-with-azl-repos` — thin wrapper around `dnf` that probes one or more URL prefixes for the Standard Azure Linux Repo Layout, enables every sub-repo it discovers (silently skipping ones that don't exist), and execs `dnf` with those repos added on the command line. * `_repo_layout.py` — shared definition of the Standard Azure Linux Repo Layout (channels, sub-repo kinds, per-kind URL template) consumed by both scripts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds three new scripts under scripts/repo/ for synthesizing Azure Linux per-channel/per-arch repodata trees from upstream RPM repositories and for invoking dnf against discovered Azure Linux repos. The scripts share a common layout definition (_repo_layout.py) that encodes the fixed channel × kind × arch matrix.
Changes:
- New
synthesize-repodata.pythat fetches upstream repodata, queriesazldev package listto assign packages to channels (with sibling-rpm inheritance fallback), and emits routedcreaterepo_crepodata referencing the original upstream URLs. - New
dnf-with-azl-reposwrapper that probes URL prefixes for the standard layout (silently skipping 404s) and execsdnfwith the discovered sub-repos enabled. - New
_repo_layout.pyshared module defining the standardCHANNELS,KIND_*constants, andSUBREPOStable.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| scripts/repo/synthesize-repodata.py | Main synth tool: download repodata, build NEVRA universe, query azldev, decide routing per package, emit per-destination repodata + unpublished/fallback reports. |
| scripts/repo/dnf-with-azl-repos | Thin dnf wrapper: HEAD-probe sub-repos under each --repo-prefix, build --repofrompath/--enablerepo args, exec dnf. |
| scripts/repo/_repo_layout.py | Shared constants/dataclass describing the six standard sub-repos. |
|
|
||
| We pull primary/filelists/other for the package universe AND every | ||
| auxiliary record (updateinfo, group, modules, ...) so phase 6 can | ||
| copy non-package metadata through to routed destinations. | ||
|
|
||
| Returns the path to the dir containing ``repodata/``, or None if | ||
| the repo's ``repomd.xml`` returned 404 and *repo* was prefix-derived | ||
| (silent skip). Other HTTP errors and explicit-origin 404s raise. | ||
| """ |
| """HEAD ``<probe_url>/repodata/repomd.xml``. | ||
|
|
||
| Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP | ||
| responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404, | ||
| and ``(_PROBE_FAIL, "...")`` on any other transport error or | ||
| non-2xx HTTP status. The error string is suitable for inclusion in | ||
| a fatal-error message so the user can see the underlying cause. | ||
| """ | ||
| url = f"{probe_url.rstrip('/')}/repodata/repomd.xml" | ||
| req = urllib.request.Request( | ||
| url, method="HEAD", headers={"User-Agent": USER_AGENT} | ||
| ) | ||
| try: | ||
| with urllib.request.urlopen(req, timeout=timeout) as resp: | ||
| # ``status`` is the HTTP status code for http(s); for | ||
| # ``file://`` and other non-HTTP schemes urllib's response | ||
| # has no status attribute -- a successful urlopen there | ||
| # already proved the file exists. | ||
| status = getattr(resp, "status", None) | ||
| if status is None or 200 <= status < 300: | ||
| return _PROBE_OK, None | ||
| return _PROBE_FAIL, f"HTTP {status}" | ||
| except urllib.error.HTTPError as e: | ||
| if e.code == 404: | ||
| return _PROBE_MISSING, None | ||
| return _PROBE_FAIL, f"HTTP {e.code}" | ||
| except urllib.error.URLError as e: | ||
| # urllib wraps a `file://` ENOENT as URLError(FileNotFoundError); | ||
| # treat that as MISSING so local fixtures behave like the HTTP 404 | ||
| # case. | ||
| if isinstance(e.reason, FileNotFoundError): | ||
| return _PROBE_MISSING, None | ||
| return _PROBE_FAIL, f"URL error: {e.reason}" | ||
| except TimeoutError: | ||
| return _PROBE_FAIL, f"timed out after {timeout:.0f}s" | ||
| except OSError as e: | ||
| return _PROBE_FAIL, f"OS error: {e}" | ||
|
|
||
|
|
| if found_here == 0 and not failures: | ||
| log(f"{PROG}: warning: no repos discovered under {prefix_trim}") | ||
| total_found += found_here |
| else: | ||
| # No $basearch: caller is asserting "this URL is for one specific | ||
| # arch". We can't tell which from the URL alone, so we infer from the | ||
| # last path component if it matches a known arch; otherwise refuse. | ||
| # Strip query/fragment first so signed URLs (`...?sig=...`) don't | ||
| # poison the inference. | ||
| parts = urllib.parse.urlsplit(url) | ||
| path = parts.path.rstrip("/") | ||
| last = path.rsplit("/", 1)[-1] if path else "" | ||
| if last in arches: | ||
| out.append(InputRepo(kind, last, url.rstrip("/"), "explicit")) | ||
| else: | ||
| raise ValueError( | ||
| f"--repo {spec!r}: URL has no `$basearch` and its final path " | ||
| f"component {last!r} is not a known arch ({', '.join(arches)}); " | ||
| f"cannot determine arch" | ||
| ) | ||
| return out |
| for record in repomd.records: | ||
| # Only fetch the records we'll actually consume (primary, | ||
| # filelists, other, plus their _db variants). See | ||
| # PACKAGE_RECORD_TYPES above for why we skip aux records. | ||
| if record.type not in PACKAGE_RECORD_TYPES: | ||
| continue | ||
| href = record.location_href or "" | ||
| if not href: | ||
| continue | ||
| url = urllib.parse.urljoin(base, href) | ||
| # Constrain the cache destination path so a hostile/malformed | ||
| # repomd can't write outside cache_dir. | ||
| safe_rel = href.lstrip("/") | ||
| if ".." in Path(safe_rel).parts: | ||
| raise RuntimeError( | ||
| f"refusing to write metadata record outside cache: {href!r}" | ||
| ) | ||
| dest = cache_dir / safe_rel | ||
| log(f" fetching {url}") | ||
| _http_get(url, dest, ssl_context) | ||
| return cache_dir |
| def probe_repo(probe_url: str, *, timeout: float = PROBE_TIMEOUT) -> tuple[str, str | None]: | ||
| """HEAD ``<probe_url>/repodata/repomd.xml``. | ||
|
|
||
| Returns ``(_PROBE_OK, None)`` on 2xx (or successful non-HTTP | ||
| responses such as ``file://``), ``(_PROBE_MISSING, None)`` on 404, | ||
| and ``(_PROBE_FAIL, "...")`` on any other transport error or | ||
| non-2xx HTTP status. The error string is suitable for inclusion in | ||
| a fatal-error message so the user can see the underlying cause. | ||
| """ | ||
| url = f"{probe_url.rstrip('/')}/repodata/repomd.xml" | ||
| req = urllib.request.Request( | ||
| url, method="HEAD", headers={"User-Agent": USER_AGENT} | ||
| ) | ||
| try: | ||
| with urllib.request.urlopen(req, timeout=timeout) as resp: | ||
| # ``status`` is the HTTP status code for http(s); for | ||
| # ``file://`` and other non-HTTP schemes urllib's response | ||
| # has no status attribute -- a successful urlopen there | ||
| # already proved the file exists. | ||
| status = getattr(resp, "status", None) | ||
| if status is None or 200 <= status < 300: | ||
| return _PROBE_OK, None | ||
| return _PROBE_FAIL, f"HTTP {status}" | ||
| except urllib.error.HTTPError as e: | ||
| if e.code == 404: | ||
| return _PROBE_MISSING, None | ||
| return _PROBE_FAIL, f"HTTP {e.code}" | ||
| except urllib.error.URLError as e: | ||
| # urllib wraps a `file://` ENOENT as URLError(FileNotFoundError); | ||
| # treat that as MISSING so local fixtures behave like the HTTP 404 | ||
| # case. | ||
| if isinstance(e.reason, FileNotFoundError): | ||
| return _PROBE_MISSING, None | ||
| return _PROBE_FAIL, f"URL error: {e.reason}" | ||
| except TimeoutError: | ||
| return _PROBE_FAIL, f"timed out after {timeout:.0f}s" | ||
| except OSError as e: | ||
| return _PROBE_FAIL, f"OS error: {e}" | ||
|
|
| routing = query_azldev( | ||
| args.repo_root, src_map, output_dir, known_components |
| f"Arch to expand `$basearch` into (default: " | ||
| f"{', '.join(DEFAULT_ARCHES)}). Repeatable." |
binujp
left a comment
There was a problem hiding this comment.
Thank you! Given this already caught issues which would have otherwise led to much head-scratching, we should just merge it. This is functional enough to iterate based on usage feedback.
As I said in the comment, couple of example invocations will be great.
|
I thought we were trying to keep tooling out of our source repo? |
Adds three new tools under
scripts/repo/for working with Azure Linux package repositories:synthesize-repodata.py— given one or more upstream repo prefixes (Standard Azure Linux Repo Layout: per-channel main/debuginfo/srpms sub-repos) and/or explicit per-repo overrides, synthesize a fresh set of per-destination repodata trees that route each package to its intended channel based on azldev component metadata. Local repo overrides take precedence over upstream when NEVRAs collide (CLI order is preserved). The output is a static directory tree of standardcreaterepo_crepodata with absolute upstream URLs in package locations, so the synthesized repodata can be served from anywhere without needing to mirror the RPM content.dnf-with-azl-repos— thin wrapper arounddnfthat probes one or more URL prefixes for the Standard Azure Linux Repo Layout, enables every sub-repo it discovers (silently skipping ones that don't exist), and execsdnfwith those repos added on the command line._repo_layout.py— shared definition of the Standard Azure Linux Repo Layout (channels, sub-repo kinds, per-kind URL template) consumed by both scripts.